Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynparquet: remove rowWriter #561

Closed
wants to merge 1 commit into from
Closed

dynparquet: remove rowWriter #561

wants to merge 1 commit into from

Conversation

asubiotto
Copy link
Member

The functionality this gave us is already implemented by the parquet library. This also increases compaction efficiency since we can use parquet.Copy.

This gives us a ~20% reduction in compaction memory usage when compacting replayed prod data:

goos: darwin
goarch: arm64
pkg: github.com/polarsignals/frostdb
          │ benchnew_mem │          benchcopy_mem           │
          │    sec/op    │   sec/op    vs base              │
Replay-12     23.21 ± 1%   21.15 ± 0%  -8.87% (p=0.002 n=6)

          │ benchnew_mem │            benchcopy_mem            │
          │     B/op     │     B/op      vs base               │
Replay-12   42.07Gi ± 4%   34.53Gi ± 4%  -17.92% (p=0.002 n=6)

          │ benchnew_mem │           benchcopy_mem           │
          │  allocs/op   │  allocs/op   vs base              │
Replay-12    125.8M ± 1%   132.8M ± 0%  +5.57% (p=0.002 n=6)

However, it seems like queries suffer:

goos: darwin
goarch: arm64
pkg: github.com/polarsignals/frostdb
                │ benchswquery │            benchcopyquery             │
                │    sec/op    │    sec/op      vs base                │
Query/Types-12    7.994m ±  2%   35.582m ±  4%  +345.11% (p=0.002 n=6)
Query/Labels-12   348.8µ ± 10%    340.9µ ±  1%    -2.26% (p=0.002 n=6)
Query/Values-12   1.961m ±  1%   19.438m ±  2%  +891.00% (p=0.002 n=6)
Query/Merge-12    63.31m ±  3%   266.99m ±  4%  +321.68% (p=0.002 n=6)
Query/Range-12    33.21m ±  2%   252.21m ±  2%  +659.45% (p=0.002 n=6)
Query/Filter-12   8.020m ±  1%   37.202m ± 15%  +363.89% (p=0.002 n=6)
geomean           6.722m          28.97m        +330.94%

                │ benchswquery  │            benchcopyquery            │
                │    B/msec     │    B/msec      vs base               │
Query/Filter-12   18.168Mi ± 1%   3.759Mi ± 13%  -79.31% (p=0.002 n=6)

I'm really not sure why. It seems like row groups are correctly sized and in fact we read fewer row groups with this approach. Still need to investigate.

The functionality this gave us is already implemented by the parquet library.
This also increases compaction efficiency since we can use parquet.Copy.
@asubiotto asubiotto requested a review from thorfour October 11, 2023 14:05
Copy link
Contributor

@thorfour thorfour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it as long as we can fix why queries are so much worse suddenly

@asubiotto
Copy link
Member Author

Looks like it's because the RowWriterTo fast path in CopyRows has a bug when using a merged row group as input. Seems like the output rows are not sorted. Not using the fast path fixes the sorting/query latency but doesn't give us any memory benefits. Leaving this aside for now.

@asubiotto asubiotto closed this Nov 1, 2023
@asubiotto asubiotto deleted the alfonso-pqcopy branch November 1, 2023 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants